Skip to content

Conversation

@abhijeet-dhumal
Copy link
Contributor

@abhijeet-dhumal abhijeet-dhumal commented Sep 12, 2025

What this PR does / why we need it:

Fixes #87, #92, #116

Sample import styles for Options :

# Style 1: Only works with Kubernetes + common import options
from kubeflow.trainer import (
    Labels, Annotations, SpecLabels, SpecAnnotations,
    Name, PodTemplateOverrides, TrainerImage, TrainerCommand, TrainerArgs,
    PodTemplateOverride, PodTemplateSpecOverride, ContainerOverride
)

# Style 2: Import all options - SDK validates at runtime and rejects incompatible ones
from kubeflow.trainer.options import (
    # Kubernetes options
    Labels, Annotations, SpecLabels, SpecAnnotations, Name,
    PodTemplateOverrides, TrainerImage, TrainerCommand, TrainerArgs,
    # Common options (used in PodTemplateOverrides)
    PodTemplateOverride, PodTemplateSpecOverride, ContainerOverride,
    # LocalProcess options
    ProcessTimeout, WorkingDirectory
)

# Style 3: Backend-specific imports
from kubeflow.trainer.options.kubernetes import (
    Labels, Annotations, SpecLabels, SpecAnnotations, Name,
    PodTemplateOverrides, TrainerImage, TrainerCommand, TrainerArgs
)
from kubeflow.trainer.options.common import (
    PodTemplateOverride, PodTemplateSpecOverride, ContainerOverride
)
from kubeflow.trainer.options.localprocess import (
    ProcessTimeout, WorkingDirectory
)

Job Metadata

# TrainJob resource metadata (for TrainJob level Labels and annotations)
Labels(labels={"team": "ml-platform", "project": "llm"})
Annotations(annotations={"description": "Fine-tuning job", "owner": "user"})
Name(name="my-custom-job-name")

# Derivative JobSet and Jobs labels (for Pod level Labels and annotations)
SpecLabels(labels={"app": "training", "version": "v1.0", "env": "prod"})
SpecAnnotations(annotations={"prometheus.io/scrape": "true", "cost-center": "ai"})

Container runtime customisation :

TrainerImage(image="custom/pytorch:2.0-cuda11.8")
TrainerCommand(command=["python", "train.py"])
TrainerArgs(args=["--epochs", "100", "--lr", "0.001", "--batch-size", "32"])

Pod level overrides:

PodTemplateOverrides(pod_template_overrides=[
    PodTemplateOverride(
        target_jobs=["node", "launcher"],
        metadata={  # Optional: pod-level labels/annotations
            "labels": {"tier": "training"},
            "annotations": {"sidecar.istio.io/inject": "false"}
        },
        spec=PodTemplateSpecOverride(
            service_account_name="training-sa",
            node_selector={"gpu": "true", "zone": "us-west1"},
            affinity={
                "nodeAffinity": {
                    "requiredDuringSchedulingIgnoredDuringExecution": {
                        "nodeSelectorTerms": [{
                            "matchExpressions": [{
                                "key": "gpu-type",
                                "operator": "In",
                                "values": ["nvidia-a100"]
                            }]
                        }]
                    }
                }
            },
            tolerations=[
                {"key": "gpu", "operator": "Exists", "effect": "NoSchedule"},
                {"key": "high-priority", "operator": "Equal", "value": "true", "effect": "NoSchedule"}
            ],
            volumes=[
                {"name": "data", "emptyDir": {}},
                {"name": "model", "persistentVolumeClaim": {"claimName": "model-pvc"}}
            ],
            containers=[
                ContainerOverride(
                    name="node",
                    env=[
                        {"name": "NCCL_DEBUG", "value": "INFO"},
                        {"name": "WANDB_API_KEY", "valueFrom": {"secretKeyRef": {
                            "name": "wandb-secret", "key": "api-key"
                        }}}
                    ],
                    volume_mounts=[
                        {"name": "data", "mountPath": "/data"},
                        {"name": "model", "mountPath": "/model"}
                    ]
                )
            ],
            init_containers=[
                ContainerOverride(
                    name="init-setup",
                    env=[{"name": "SETUP_MODE", "value": "fast"}],
                    volume_mounts=[{"name": "data", "mountPath": "/setup"}]
                )
            ],
            scheduling_gates=[{"name": "wait-for-quota"}],
            image_pull_secrets=[{"name": "registry-credentials"}]
        )
    )
])

Local process backend options:

from kubeflow.trainer.options import ProcessTimeout, WorkingDirectory
ProcessTimeout(timeout_seconds=3600)
WorkingDirectory(working_dir="/custom/training/path")

Complete example :

job_name = client.train(
    runtime=client.get_runtime("torch-distributed"),
    trainer=CustomTrainer(
        func=lambda: print("Training LLM!"),
        num_nodes=2,
        resources_per_node={
            "cpu": "8",
            "memory": "32Gi",
            "nvidia.com/gpu": "2"
        },
        packages_to_install=["torch==2.0.0", "transformers", "datasets"],
        env={"WANDB_PROJECT": "llm-training"}
    ),
    options=[
        # TrainJob resource metadata
        Name(name="llm-finetune-001"),
        Labels(labels={"owner": "ml-team", "project": "llm"}),
        Annotations(annotations={"description": "GPT fine-tuning experiment"}),
        
        # Derivative JobSet/Jobs labels (for training pods)
        SpecLabels(labels={"app": "training", "version": "v1.0", "env": "prod"}),
        SpecAnnotations(annotations={"prometheus.io/scrape": "true"}),
        
        # Trainer container customization
        TrainerImage(image="my-registry/pytorch:2.0-cuda11.8"),
        
        # Pod-level customizations
        PodTemplateOverrides(pod_template_overrides=[
            PodTemplateOverride(
                target_jobs=["node"],
                spec=PodTemplateSpecOverride(
                    service_account_name="gpu-training-sa",
                    node_selector={"accelerator": "nvidia-a100", "zone": "us-west1"},
                    tolerations=[
                        {"key": "nvidia.com/gpu", "operator": "Exists", "effect": "NoSchedule"}
                    ],
                    volumes=[
                        {"name": "data", "persistentVolumeClaim": {"claimName": "training-data-pvc"}},
                        {"name": "cache", "emptyDir": {"sizeLimit": "50Gi"}}
                    ],
                    containers=[
                        ContainerOverride(
                            name="node",
                            env=[
                                {"name": "NCCL_DEBUG", "value": "INFO"},
                                {"name": "NCCL_IB_DISABLE", "value": "1"},
                                {"name": "WANDB_API_KEY", "valueFrom": {
                                    "secretKeyRef": {"name": "wandb-creds", "key": "api-key"}
                                }}
                            ],
                            volume_mounts=[
                                {"name": "data", "mountPath": "/data"},
                                {"name": "cache", "mountPath": "/cache"}
                            ]
                        )
                    ]
                )
            )
        ])
    ]
)

Checklist:

  • Docs included if any changes are user facing

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@abhijeet-dhumal abhijeet-dhumal changed the title Add labels and annotations support for train client feast: Add labels and annotations support for train client Sep 12, 2025
@abhijeet-dhumal abhijeet-dhumal changed the title feast: Add labels and annotations support for train client feat: Add labels and annotations support for train client Sep 15, 2025
@abhijeet-dhumal abhijeet-dhumal force-pushed the add-labels-annotations-support#87 branch from 67d12d8 to b39b364 Compare September 15, 2025 05:22
@abhijeet-dhumal abhijeet-dhumal force-pushed the add-labels-annotations-support#87 branch 2 times, most recently from 2d7a8e6 to dbba135 Compare September 15, 2025 05:44
@abhijeet-dhumal abhijeet-dhumal changed the title feat: Add labels and annotations support for train client feat: Implement Training Options pattern with WithLabels, WithAnnotations, and WithPodSpecOverrides for flexible TrainJob customization Sep 15, 2025
@abhijeet-dhumal abhijeet-dhumal changed the title feat: Implement Training Options pattern with WithLabels, WithAnnotations, and WithPodSpecOverrides for flexible TrainJob customization feat: Implement Training Options pattern for flexible TrainJob customization Sep 15, 2025
@abhijeet-dhumal abhijeet-dhumal force-pushed the add-labels-annotations-support#87 branch from 36c0160 to 95155f6 Compare September 15, 2025 10:41
@abhijeet-dhumal abhijeet-dhumal marked this pull request as ready for review September 15, 2025 11:23
Copy link
Member

@szaher szaher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @abhijeet-dhumal

I believe we need to handle how options will be applied for while backends. either we ignore options for localprocess or make options targeted towards specific backend.

@abhijeet-dhumal abhijeet-dhumal marked this pull request as draft September 17, 2025 13:17
@abhijeet-dhumal abhijeet-dhumal force-pushed the add-labels-annotations-support#87 branch 3 times, most recently from f092845 to 03994e7 Compare September 24, 2025 11:40
@coveralls
Copy link

coveralls commented Sep 24, 2025

Pull Request Test Coverage Report for Build 18750401322

Details

  • 37 of 37 (100.0%) changed or added relevant lines in 2 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+2.7%) to 76.11%

Files with Coverage Reduction New Missed Lines %
kubeflow/trainer/utils/utils.py 1 70.09%
Totals Coverage Status
Change from base Build 18655221655: 2.7%
Covered Lines: 360
Relevant Lines: 473

💛 - Coveralls

@abhijeet-dhumal abhijeet-dhumal force-pushed the add-labels-annotations-support#87 branch 3 times, most recently from 6908d56 to 9f4098c Compare October 8, 2025 15:41
@abhijeet-dhumal abhijeet-dhumal force-pushed the add-labels-annotations-support#87 branch 3 times, most recently from 0687968 to 3b7f688 Compare October 8, 2025 17:10
@abhijeet-dhumal abhijeet-dhumal marked this pull request as ready for review October 8, 2025 17:11
metadata=models.IoK8sApimachineryPkgApisMetaV1ObjectMeta(
name=train_job_name, labels=labels, annotations=annotations
),
spec=models.TrainerV1alpha1TrainJobSpec(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @abhijeet-dhumal,
I was looking for workarounds to overcome some of the Kubeflow SDK limitations (like specifying PVCs for particular training jobs, etc..) and came across your PR, which seems to fix those limitations. I cloned your fork and found some issues with this specific part.
It looks like KubernetesBackend.train() never forwards the spec_section.get("podSpecOverrides") into the TrainerV1alpha1TrainJob we send to the API. As a result, any per-job pod settings—PVC mounts, tolerations, node selectors—are silently dropped. The podSpecOverrides argument is not used in models.TrainerV1alpha1TrainJobSpec at all. Let me know what you think.

Thanks for your amazing work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhijeet-dhumal could it be due to the new field being podTemplateOverrides now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bohdandenysenko Thanks for catching this bug! You're absolutely right—the podSpecOverrides were being extracted from options but it wasn't used in TrainJob spec.
I've just pushed some fixes, Perhaps those might help!

@astefanutti Good catch — I see in Trainer the warnings ment for PodSpecOverrides deprecation, let me check.. If the upstream API has changed to podTemplateOverrides, I should update both the option class and the forwarding logic to match with the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bohdandenysenko All tests are passing now. Please check and let me know if you see any issues with the fix!

Copy link
Contributor

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhijeet-dhumal Thanks for this amazing work!

This looks good to me overall. Could you please change to use PodTemplateOverride and address the few open comments.

/assign @kubeflow/kubeflow-sdk-team @briangallagher

metadata=models.IoK8sApimachineryPkgApisMetaV1ObjectMeta(
name=train_job_name, labels=labels, annotations=annotations
),
spec=models.TrainerV1alpha1TrainJobSpec(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhijeet-dhumal could it be due to the new field being podTemplateOverrides now?


# Import common training options (defaults to Kubernetes backend)
from kubeflow.trainer.backends.kubernetes.options import (
PodSpecOverride,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that be useful to have a dedicated kubeflow.trainer.options package?

That may be a way to consider getting rid of the With prefix, @kubeflow/kubeflow-sdk-team WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you remove the With prefix @astefanutti ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich removing With from the type names below, like PodSpecOverrides instead of WithPodSpecOverrides.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, +1 on both the dedicated options package and removing the With prefix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @astefanutti @andreyvelich
I have added Options as a dedicated package now,
Where imports will look like this for the SDK user flow :

# Style 1: only works for Kubernetes options + common options, not LocalProcess options.
from kubeflow.trainer import (
    Labels, Annotations, PodTemplateOverrides,
    PodTemplateOverride, PodTemplateSpecOverride, ContainerOverride
)

# Style 2: the SDK validates options at runtime for respective backend types and rejects incompatible ones.
from kubeflow.trainer.options import (
    Labels, Annotations, PodTemplateOverrides,
    PodTemplateOverride, PodTemplateSpecOverride, ContainerOverride,
    ProcessTimeout, WorkingDirectory  # LocalProcess options also available
)

# Style 3: Backend-specific imports
from kubeflow.trainer.options.kubernetes import (
    Labels, Annotations, PodTemplateOverrides,
    TrainerImage, TrainerCommand, TrainerArgs
)
from kubeflow.trainer.options.common import (
    PodTemplateOverride, PodTemplateSpecOverride, ContainerOverride
)
from kubeflow.trainer.options.localprocess import (
    ProcessTimeout, WorkingDirectory
)

Is this seems correct approach here ?

@abhijeet-dhumal abhijeet-dhumal force-pushed the add-labels-annotations-support#87 branch from 30b5efc to 1a08bdd Compare October 23, 2025 12:24
…podSpecOverride

Signed-off-by: Abhijeet Dhumal <abdhumal@redhat.com>
…alProcessCompatible)

Signed-off-by: Abhijeet Dhumal <abdhumal@redhat.com>
…nd Option protocol

Signed-off-by: Abhijeet Dhumal <abdhumal@redhat.com>
…ain()

Signed-off-by: Abhijeet Dhumal <abdhumal@redhat.com>
…tainer overrides

Signed-off-by: Abhijeet Dhumal <abdhumal@redhat.com>
@abhijeet-dhumal abhijeet-dhumal force-pushed the add-labels-annotations-support#87 branch 3 times, most recently from 8ea01ac to a303e7d Compare October 23, 2025 12:32
Signed-off-by: Abhijeet Dhumal <abdhumal@redhat.com>
…bSet/Jobs labels

Signed-off-by: Abhijeet Dhumal <abdhumal@redhat.com>
Copy link
Contributor

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, otherwise looks good to me.

trainer: Optional configuration for a CustomTrainer or BuiltinTrainer. If not specified,
the TrainJob will use the runtime's default values.
options: Optional list of configuration options to apply to the TrainJob. Use
WithLabels and WithAnnotations for basic metadata configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
WithLabels and WithAnnotations for basic metadata configuration.
Labels and Annotations for basic metadata configuration.

def __call__(
self,
job_spec: dict,
trainer: Optional[Union["BuiltinTrainer", "CustomTrainer"]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot that be the actual types rather than strings?



@dataclass
class Name(KubernetesCompatible):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't that be in common?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose TrainJob labels and annotations in the SDK

6 participants